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Abstract. When verifying a concurrent program, it is usual to assume 
that memory is sequentially consistent. However, most modern multi- 
processors depend on store buffering for efficiency, and provide native 
sequential consistency only at a substantial performance penalty. To re- 
gain sequential consistency, a programmer has to follow an appropriate 
programming discipline. However, naive disciplines, such as protecting 
all shared accesses with locks, are not flexible enough for building high- 
performance multiprocessor software. 

We present a new discipline for concurrent programming under TSO 
(total store order, with store buffer forwarding). It does not depend on 
concurrency primitives, such as locks. Instead, threads use ghost oper- 
ations to acquire and release ownership of memory addresses. A thread 
can write to an address only if no other thread owns it, and can read from 
an address only if it owns it or it is shared and the thread has flushed 
its store buffer since it last wrote to an address it did not own. This dis- 
cipline covers both coarse-grained concurrency (where data is protected 
by locks) as well as fine-grained concurrency (where atomic operations 
race to memory). 

We formalize this discipline in Isabelle/HOL, and prove that if every 
execution of a program in a system without store buffers follows the 
discipline, then every execution of the program with store buffers is se- 
quentially consistent. Thus, we can show sequential consistency under 
TSO by ordinary assertional reasoning about the program, without hav- 
ing to consider store buffers at all. 

1 Introduction 

When verifying a shared-memory concurrent program, it is usual to assume 
that each memory operation works directly on a shared memory state, a model 
sometimes called atomic memory. A memory implementation that provides this 
abstraction for programs that communicate only through shared memory is said 
to be sequentially consistent. Concurrent algorithms in the computing literature 
tacitly assume sequential consistency, as do most application programmers. 

However, modern computing platforms typically do not guarantee sequential 
consistency for arbitrary programs, for two reasons. First, optimizing compilers 
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are typically incorrect unless the program is appropriately annotated to indi- 
cate which program locations might be concurrently accessed by other threads; 
this issue is addressed only cursorily in this report. Second, modern processors 
buffer stores of retired instructions. To make such buffering transparent to single- 
processor programs, subsequent reads of the processor read from these buffers 
in preference to the cache. (Otherwise, a program could write a new value to 
an address but later read an older value.) However, in a multiprocessor system, 
processors do not snoop the store buffers of other processors, so a store is visible 
to the storing processor before it is visible to other processors. This can result 
in executions that are not sequentially consistent. 

The simplest example illustrating such an inconsistency is the following pro- 
gram, consisting of two threads PO and PI, where x and y are shared memory 
variables (initially 0) and rO and rl are registers: 

PO PI 

x = 1; y = 1; 

rO = y; rl = x; 

In a sequentially consistent execution, it is impossible for both rO and rl to 
be assigned 0. This is because the assignments to x and y must be executed in 
some order; if x (rcsp. y) is assigned first, then rl (rcsp. rO) will be set to 1. 
However, in the presence of store buffers, the assignments to rO and rl might be 
performed while the writes to x and y are still in their respective store buffers, 
resulting in both rO and rl being assigned 0. 

One way to cope with store buffers is make them an explicit part of the 
programming model. However, this is a substantial programming concession. 
First, because store buffers are FIFO, it ratchets up the complexity of program 
reasoning considerably; for example, the reachability problem for a finite set 
of concurrent finite-state programs over a finite set of finite-valued locations is 
in P SPACE without store buffers, but undecidable (even for two threads) with 
store buffers. Second, because writes from function calls might still be buffered 
when a function returns, making the store buffers explicit would break modular 
program reasoning. 

In practice, the usual remedy for store buffering is adherence to a program- 
ming discipline that provides sequential consistency for a suitable class of archi- 
tectures. In this report, we describe and prove the correctness of such a discipline 
suitable for the memory model provided by existing x86/x64 machines, where 
each write emerging from a store buffer hits a global cache visible to all proces- 
sors. Because each processor sees the same global ordering of writes, this model 
is sometimes called total store order (TSO) [2] 3 

3 Before 2008, Intel [7] and AMD [1] both put forward a weaker memory model in which 
writes to different memory addresses may be seen in different orders on different 
processors, but respecting causal ordering. However, current implementations satisfy 
the stronger conditions described in this report and are also compliant with the latest 
revisions of the Intel specifications [8]. According to Owens et al. [11] AMD is also 
planning a similar adaptation of their manuals. 
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The concurrency discipline most familiar to concurrent programs is one where 
each variable is protected by a lock, and a thread must hold the corresponding 
lock to access the variable. (It is possible to generalize this to allow shared locks, 
as well as variants such as split semaphores.) Such lock-based techniques are 
typically referred to as coarse-grained concurrency control, and suffice for most 
concurrent application programming. However, these techniques do not suffice 
for low-level system programming (e.g., the construction of OS kernels), for sev- 
eral reasons. First, in kernel programming efficiency is paramount, and atomic 
memory operations are more efficient for many problems. Second, lock-free con- 
currency control can sometimes guarantee stronger correctness (e.g., wait-free 
algorithms can provide bounds on execution time). Third, kernel programming 
requires taking into account the implicit concurrency of concurrent hardware ac- 
tivities (e.g., a hardware TLB racing to use page tables while the kernel is trying 
to access them), and hardware cannot be forced to follow a locking discipline. 

A more refined concurrency control discipline, one that is much closer to 
expert practice, is to classify memory addresses as lock-protected or shared. 
Lock-protected addresses are used in the usual way, but shared addresses can 
be accessed using atomic operations provided by hardware (e.g., on x86 class 
architectures, most reads and writes are atomic 4 ). The main restriction on these 
accesses is that if a processor does a shared write and a subsequent shared read 
(possibly from a different address), the processor must flush the store buffer 
somewhere in between. For example, in the example above, both x and y would 
be shared addresses, so each processor would have to flush its store buffer between 
its first and second operations. 

However, even this discipline is not very satisfactory. First, we would need 
even more rules to allow locks to be created or destroyed, or to change memory 
between shared and protected, and so on. Second, there are many interesting 
concurrency control primitives, and many algorithms, that allow a thread to 
obtain exclusive ownership of a memory address; why should we treat locking as 
special? 

In this report, we consider a much more general and powerful discipline that 
also guarantees sequential consistency. The basic rule for shared addresses is 
similar to the discipline above, but there are no locking primitives. Instead, we 
treat ownership as fundamental. The difference is that ownership is manipulated 
by nonblocking ghost updates, rather than an operation like locking that have 
runtime overhead. Informally the rules of the discipline are as follows: 

— In any state, each memory address is either shared or unshared. Each memory 
address is also either owned by a unique thread or unowned. Every unowned 
address must be shared. Each address is also either read-only or read-write. 
Every unshared address must be read-write. 

— A thread can (autonomously) acquire ownership of an unowned address, or 
release ownership of a address that it owns. It can also change whether an 

4 This atomicity isn't guaranteed for certain memory types, or for operations that 
cross a cache line. 
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address it owns is shared or not. Upon release of an address it can mark it 
as read-only. 

— Each memory access is marked as volatile or non-volatile. 

— A thread can perform a write if it is sound. It can perform a read if it is 
sound and clean. 

— A non- volatile access is sound if the thread owns the address and the address 
is unshared. A non- volatile read to a read-only shared address is also sound. 

— A volatile write is sound if no other thread owns the address and the address 
is not marked as read-only. 

— A volatile read is sound if the address is shared or the thread owns it. 

— A read is clean if the store buffer has been flushed since the last volatile 
write. Additionally a non-volatile read is clean if the store buffer has been 
flushed since the address was acquired. 

— For interlocked operations (like compare and swap), which have the side 
effect of the store buffer getting flushed, the rules for volatile accesses apply. 

Note first that these conditions are not thread-local, because some actions 
are allowed only when an address is unowned, marked read-only, or not marked 
read-only. A thread can ascertain such conditions only through system-wide in- 
variants, respected by all threads, along with data it reads. By imposing suitable 
global invariants, various thread-local disciplines (such as one where addresses 
are protected by locks, conditional critical reasons, or monitors) can be derived 
as lemmas by ordinary program reasoning, without need for metatheory. 

Second, note that these rules can be checked in the context of a concurrent 
program without store buffers, by introducing ghost state to keep track of own- 
ership, whether the thread has performed a write since the last flush, and which 
owned addresses were acquired since the last flush. Our main result is that if a 
program obeys the rules above, then the program is sequentially consistent when 
executed on a TSO machine. 

Consider our first example program. If we choose to leave both x and y 
shared, then all accesses must be volatile. This would force each thread to flush 
the store buffer between their first and second operations. In practice, on an 
x86/x64 machine, this would be done by making the writes interlocked, which 
flushes store buffers as a side effect. Whichever thread flushes its store buffer 
second is guaranteed to see the write of the other thread, making the execution 
violating sequential consistency impossible. 

However, couldn't the first thread try to take ownership of x before writing 
it, so that its write could be non-volatile? The answer is that it could, but then 
the second thread would be unable to read x volatile (or take ownership of x and 
read it non- volatile) , because we would be unable to prove that x is unowned at 
that point. In other words, a thread can take ownership of an address only if it 
is not racing to do so. 

Ultimately, the races allowed by the discipline involve volatile access to a 
shared address, which brings us back to locks. A spinlock is typically imple- 
mented with an interlocked read-modify-write on an address (the interlocking 
providing the required flushing of the store buffer). If the locking succeeds, we 
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can prove (using for example a ghost variable giving the ID of the thread taking 
the lock) that no other thread holds the lock, and can therefore safely take own- 
ership of an address "protected" by the lock (using the global invariant that only 
the lock owner can own the protected address). Thus, our discipline subsumes 
the better-known disciplines governing coarse-grained concurrency control. 

Overview In Section 2 we introduce preliminaries of Isabellc/HOL, the theorem 
prover in which we mechanized our work. In Section 3 we informally describe the 
programming discipline and basic ideas of the formalization, which is detailed 
in Section 4. Finally we conclude in Section 5. 

2 Preliminaries 

The formalization presented in this papaer is mechanized and checked within 
the generic interactive theorem prover Isabelle [12]. Isabelle is called generic as 
it provides a framework to formalize various object logics declared via natural 
deduction style inference rules. The object logic that we employ for our formal- 
ization is the higher order logic of Isabelle/HOL [10]. 

This article is written using Isabclle's document generation facilities, which 
guarantees a close correspondence between the presentation and the actual the- 
ory files. We distinguish formal entities typographically from other text. We use 
a sans serif font for types and constants (including functions and predicates), 
e.g., map, a slanted serif font for free variables, e.g., x, and a slanted sans serif 
font for bound variables, e.g., x. Small capitals are used for data type construc- 
tors, e.g., Foo, and type variables have a leading tick, e.g., 'a. HOL keywords 
are typeset in type- writer font, e.g., let. 

To group common premises and to support modular reasoning Isabelle pro- 
vides locales [4,5]. A locale provides a name for a context of fixed parameters 
and premises, together with an elaborate infrastructure to define new locales by 
inheriting and extending other locales, prove theorems within locales and inter- 
pret (instantiate) locales. In our formalization we employ this infrastructure to 
separate the memory system from the programming language semantics. 

The logical and mathematical notions follow the standard notational con- 
ventions with a bias towards functional programming. We only present the more 
unconventional parts here. We prefer curried function application, e.g., fab 
instead of f(a, b). In this setting the latter becomes a function application to 
one argument, which happens to be a pair. 

Isabellc/HOL provides a library of standard types like Booleans, natural 
numbers, integers, total functions, pairs, lists, and sets. Moreover, there are 
packages to define new data types and records. Isabelle allows polymorphic types, 
e.g., 'a list is the list type with type variable 'a. In HOL all functions are total, 
e.g., nat => nat is a total function on natural numbers. A function update is 
f(y := v) = Ax. if x = y then v else f x. To formalize partial functions the 
type 'a option is used. It is a data type with two constructors, one to inject 
values of the base type, e.g., [xj, and the additional element _L. A base value 
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can be projected with the function the, which is defined by the sole equation 
the [x\ = x. Since HOL is a total logic the term the _L is still a well-defined 
yet un(der) specified value. Partial functions are usually represented by the type 
'a =^> 'b option, abbreviated as 'a — 1 'b. They are commonly used as maps. We 
denote the domain of map m by dom m. A map update is written as m(a i— » v). 
We can restrict the domain of a map m to a set A by iu\a. 

The syntax and the operations for lists are similar to functional programming 
languages like ML or Haskell. The empty list is [], with x ■ xs the element x is 
'consed' to the list xs.With xs @ ys list ys is appended to list xs. With the term 
map f xs the function f is applied to all elements in xs. The length of a list is 
\xs\, the u-th element of a list can be selected with xs[„i and can be updated 
via xs[n := v]. With dropWhile P xs the prefix for which all elements satisfy 
predicate P are dropped from list xs. 

Sets come along with the standard operations like union, i.e., A U B, mem- 
bership, i.e., x £ A and set inversion, i.e., — A. 

Tuples with more than two components are pairs nested to the right. 

3 Programming discipline 

For sequential code on a single processor the store buffer is invisible, since reads 
respect outstanding writes in the buffer. This argument can be extended to 
thread local memory in the context of a multiprocessor architecture. Memory 
typically becomes temporarily thread local by means of locking. The C-idiom 
to identify shared portions of the memory is the volatile tag on variables and 
type declarations. Thread local memory can be accessed non-volatilely, whereas 
accesses to shared memory are tagged as volatile. This prevents the compiler from 
applying certain optimizations to those accesses which could cause undesired 
behavior, e.g., to store intermediate values in registers instead of writing them 
to the memory. 

The basic idea behind the programming discipline is, that before gathering 
new information about the shared state (via reading) the thread has to make the 
outstanding changes to the shared state visible to others (by flushing the store 
buffer) . This allows to sequentializc the reads and writes to obtain a sequentially 
consistent execution of the global system. In this sequcntialization a write to 
shared memory happens when the write instruction exits the store buffer, and a 
read from the shared memory happens when all preceding writes have exited. 

We distinguish thread local and shared memory by an ownership model. 
Ownership is maintained in ghost state and can be transferred as side effect 
of write operations and by special ghost operations. Every thread has a set of 
owned addresses. Owned addresses of different threads are disjoint. Moreover, 
there is a global set of shared addresses which can additionally be marked as read- 
only. Unowned addresses — addresses owned by no thread — can be accessed 
concurrently by all threads. They are a subset of the shared addresses. The read- 
only addresses are a subset of the unowned addresses. We only allow a thread 
to write to owned addresses and unowned, read- write addresses. We only allow 
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a thread to read from owned addresses and from shared addresses (even if they 
are owned by another thread). 

All writes to shared memory have to be volatile. Reads from shared addresses 
also have to be volatile, except if the address is owned (i.e., single writer, multiple 
readers) or if the address is read-only. Moreover, non- volatile writes are restricted 
to owned, unshared memory. As long as a thread owns an address it is guaranteed 
that it is the only one writing to that address. Hence this thread can safely 
perform non- volatile reads to that address without missing any write. Similar it 
is safe for any thread to access read-only memory via non-volatile reads since 
there are no outstanding writes at all. 

Recall that a read is clean if it is guaranteed that there is no outstanding 
volatile write (to any address) in the store buffer. Additionally non- volatile reads 
which where not freshly acquired since the last flush are considered clean. To 
regain sequential consistency under the presence of store buffers every thread has 
to make sure that every read is clean, by flushing the store buffer when necessary. 
To check the flushing policy of a thread, we keep track of clean reads by means 
of ghost state. For every thread we maintain a dirty flag and a set of acquired 
addresses. Both are reset as the store buffer gets flushed. Upon a volatile write 
the dirty flag is set and as an address is acquired (by ghost operations) this 
is recorded. The dirty flag and the set of acquired addresses is considered to 
guarantee that a read is clean. 

Table la summarizes the access policy and Table lb the associated flushing 
policy of the programming discipline. The key motivation is to improve perfor- 
mance by minimizing the number of store buffer flushes, while staying sequen- 
tially consistent. The need for flushing the store buffer decreases from interlocked 
accesses (where flushing is a side-effect) over volatile accesses to non-volatile 
accesses. From the viewpoint of access rights there is no difference between in- 
terlocked and volatile accesses. However, keep in mind that some interlocked 
operations can read from, modify and write to an address in a single atomic step 
of the underlying hardware and are typically used in lock-free algorithms or for 
the implementation of locks. 



Table 1: Programming discipline. 

(a) Access policy (b) Flushing policy 

flush (before) 

interlocked as side effect 
vR, R if not clean 
vW, W never 



shared shared unshared 

(read-write) (read-only) 

owned vR, vW, R unreachable vR, vW, R, W 

owned ^ unreachable 
by other 

unowned vR, vW vR, R unreachable 

(v)olatile, (R)ead, (W)rite 
all reads have to be clean 
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4 Formalization 

In this section we go into the details of our formalization. In our model, we dis- 
tinguish the plain 'memory system' from the 'programming language semantics' 
which we both describe as a small-step transition relation. During the computa- 
tion of the programming language memory instructions (read / write) are issued 
to the memory system, which itself returns the results in temporary registers. 
This clean interface allows us to parameterize the program semantics over the 
memory system. Our main theorem allows us to simulate a computation step in 
the semantics based on a memory system with store buffers by n steps in the 
semantics based on a sequentially consistent memory system. We refer to the 
former one as store buffer machine and to the latter one as virtual machine. The 
simulation theorem is independent of the programming language. 

We continue with introducing the common parts of both machines. In Section 
4.1 we then describe the virtual machine and in Section 4.2 the store buffer 
machine. Section 4.3 gives some details of our coupling relation which is used for 
the simulation proof presented in Section 4.4. Finally, in Section 4.5 we illustrate 
the integration of a programming language on top of the memory system, by 
presenting PIMP, a concurrent variant of a WHILE language. 

Addresses a, values v and temporaries t are natural numbers. Ghost an- 
notations for manipulating the ownership information are the following sets of 
addresses: the acquired addresses A, the unshared (local) fraction L of the ac- 
quired addresses, the released addresses jR and the writable fraction W of the 
released addresses (the remaining addresses are considered read-only). These 
ownership annotations are considered as side-effects on volatile writes and in- 
terlocked operations (in case a write is performed). Moreover, a special ghost 
instruction allows to acquire addresses. The possible status changes of an ad- 
dress due to these ownership transfer operations are depicted in Figure 1. A 
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Fig. 1: Ownership transfer 



memory instruction is a datatype with the following constructors: 
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— Read volatile a t for reading from address a to temporary t, where the 
Boolean volatile determines whether the access is volatile or not. 

— Write volatile a sop A L R W to write the result of evaluating the store 
operation sop at address a. A store operation is a pair (D, f), with the 
domain D and the function f. The function f takes temporaries $ as a 
parameter, which maps a temporary to a value. The subset of temporaries 
that is considered by function f is specified by the domain D. We consider 
store operations as valid when they only depend on their domain: 

valid-sop sop = VD f 6. sop = (D, f)ADC dom 6 — > f 6 = f (6\ D ) 

Again the Boolean volatile specifies the kind of memory access. 

— RMW a t sop cond ret A L R W, for atomic interlocked 'read-modify- 
write' instructions (flushing the store buffer). First the value at address a is 
loaded to temporary t, and then the condition cond on the temporaries is 
considered to decide whether a store operation is also executed. In case of a 
store the function ret, depending on both the old value at address a and the 
new value (according to store operation sop), specifies the final result stored 
in temporary t. With a trivial condition cond this instruction also covers 
interlocked reads and writes. 

— Fence, a memory fence that flushes the store buffer. 

— Ghost A L to acquire ownership on addresses. 

The configuration of a single thread is a tuple (p, is, sb, V, O, A) consisting 
of the program state p, a memory instruction list is, the map of temporaries i9, 
the store buffer sh, a dirty flag V indicating whether there may be an outstanding 
volatile write in the store buffer, the set of owned addresses O and finally the 
set of addresses A acquired since the last store buffer flush. The dirty flag T> and 
the set A are considered to specify if a read is clean: for all volatile reads and 
the non- volatile reads to addresses in A the dirty flag must not be set. 

The type of the program state p and the store buffer sb is free. For example 
we later instantiate the store buffer with the union type in case of the virtual 
machine or with a list of store buffer instructions in case of the machine with 
store buffer. 

A global configuration (ts, S, m) consists of a list of thread configurations 
ts, a Boolean map of shared addresses S (indicating write permission) and the 
memory m, which is a function from addresses to values. Addresses in the domain 
of mapping S are considered shared and the set of read-only addresses is obtained 
from S by: read-only S = {a. S a = [False]} 

We describe the computation of the global system by the non-deterministic 
transition relation (ts, S, m) => (ts', S', m') defined in Figure 2. A transition 
selects a thread ts[i\ = (p, is, sb, V, O, A) and either the 'program' the 'mem- 
ory' or the 'store buffer' makes a step. These three sub-relations are parameters 
to the global transition relation. The ownership information stored in the ghost 
components V, O, A, and S is sometimes grouped as a single component Q in 
the transition rules for succinct presentation. 

A program step i9h p ^ p (p 1 , is') takes the temporaries i? and the current 
program state p and makes a step by returning a new program state p' and an 
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i < |ts| = (p, is, sb, V, O, A) i?h p — > p (p', is') 

(ts, 5, m) => (ts[i := (p', is @ is', record p p' is' sb, 2?, 0, .4)], 5, m) 

i < \ts\ tsp] = (p, is, tf, sb, 2?, 0, .4) 
(is, 0, sb, m, 2?, O, A, S) -> m (is', < sb', m', 2?', O', .4', 5') 

(te^sTmy^7ts[^=^P^s^ 

i < \ts\ 

tS[i] = (p, is, tf, sb, 2?, 0, .4) (m, sb, 0, .4, 5) -^ sb (m', sb', 0', .4', S) 
(ts, 5, m) => (ts[i := (p, is, tf, sb', V, O', A')], S', m) 

Fig. 2: Global transitions 

instruction list is' which is appended to the remaining instructions. With the 
functional parameter record we arc able to maintain bookkeeping information 
about the program step within the store buffer. It takes the program states p 
and p', the issued instructions is' and the store buffer sb as a parameter. This 
is a technical device in our proof which allows us to remember program steps of 
the store buffer machine that are still pending in the virtual machine. 

A memory step (is, i9, sb, m, V, O, A, S) -> m (is', sb', m', V, O', A', 
S*) of a machine with store buffer may only fill its store buffer. 

In a store buffer step (m, sb, 0, A, S) -^ s b (m', sb', O', A', S 1 ) the store 
buffer may release outstanding instructions to the memory. 

4.1 Virtual machine 

The virtual machine is a sequentially consistent machine without store buffers. 
The transition rules for its memory system are defined in Figure 3. The store 
buffer, which is irrelevant in this transition system is referenced by x. We in- 
stantiate the global transition system with these rules for the memory system, 
and the identity relation for store buffer steps, the program steps are still a 
parameter. We refer to a transition by (ts, S, m) 4> (ts 1 , S', m'). 

In addition to the transition rules for the virtual machine we introduce the 
safety judgment 0s, ih (is, x, m, T>, O, A, S) \/ in Figure 4, where Os is 
the list of ownership sets obtained from the thread list ts and i is the thread 
index. Safety of all reachable states of the virtual machine ensures that the 
access policy is obeyed by the program and is our formal prerequisite for the 
simulation theorem. It is left as a proof obligation to be discharged by means of 
a proper program logic for sequentially consistent executions. In the following we 
elaborate on the rules of Figures 3 and 4 in parallel. To read from an address it 
either has to be owned or read-only or it has to be volatile and shared. Moreover 
the read has to be clean. The memory content of address a is stored in temporary 
t. Non-volatile writes are only allowed to owned and unshared addresses. The 
result is written directly into the memory. A volatile write is only allowed when 
no other thread owns the address and the address is not marked as read-only. 
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(Read volatile a t ■ is, x, m, Q) A m (is, #(t i— > m a), x, m, 5) 



(Write False a(D,f) ALRW ■ is, x, m, Q) A m (is, x, m(a := f 5) 

5 = (©, O, A 5) 0' = (True, O U A - R, A U A - R, S ©w R 0a L) 
(Write True a (D, f) A L R W ■ is, tf, x, m, 5) ^>m (is, tf, x, m(a := f tf), £') 

n cond (tf(t i-»m a)) 5 = (X>, 0, A, 5) 0' = (False, 0, 0, <S) 
(RMW a t (D, f) cond ret A L R W ■ is, d, x, m, Q) A m (is, tf(t h-> m a), x, m, 5') 

cond ($(t i — » m a)) 

tf' = tf(t i-» ret (m a) (f (tf(t >-> m a)))) m' = m(a := f (tf(t h-» m a))) 

5 = (2?, OM,S) 0' = (False, O U A - R, 0, S ®w R ©a L) 
(RMW a t (D, f) cond ret A L R W ■ is, 0, x, m, Q) A m (is, x, m', 5') 

(Fence • is, d, x, m, V, O, A, S) ^> m (is, x, m, False, O, 0, 5) 

(Ghost A L • is, tf, x, m, V, O, A, S) A m (is, tf, x, m, V, O U A, A U A, 5 0a L) 

Fig. 3: Memory transitions of the virtual machine 

a G O V a G read-only 5 V volatile A a G dom 5 
volatile — > ^ D -> volatile — > a G A — > -> V 
Os,i\- (Read volatile a t ■ is, ti, x, m, V, O, A, S) ^ 

a G O a ^ dom >S 
Os,il- (Write False a (D, f) A L R ff ■ is, x, m, £>, 0, A, S) y/ 

V j'<|Os|. i=£j — > a i Os y] 
a f read-only S V j<\Os\. i^j — > A n Os w = 
AC0U dom S L C A RCO A n R = 

Os,ih (Write True a (D, f) A L R W ■ is, x, m, V, O, A, S) y/ 

-i cond (i9(t I— » in a)) a G dom S U O 
Os,i\- (RMW a t (D, f) cond ret A L R W ■ is, x, m, V, O, A, S) \/ 

cond (tf(t i-> m a)) V j<|Os|. i ^ j — > a £ Os w 
a ^ read-only 5 V j<\Os\. i^j — > A n Os w = 
AC0U dom S L C A RCO A n R = 

Os,ih (RMW a t (D, f) cond ret A L R W ■ is, d, x, m, V, O, A, S) yj 

Os,i\- (Fence • is, d, x, m, V, O, A, S) yj 

A C dom 5uO LCA V j<|0s|. i^j — > A n Os u = 
Os,ih (Ghost A L • is, tf, x, m, 2?, O, A, S) yj 

Fig. 4: Safe configurations of a virtual machine 
11 



Simultaneously with the volatile write we can transfer ownership as specified by 
the annotations A, L, R and W. The acquired addresses A must not be owned 
by any other thread and stem from the shared addresses or are already owned. 
Reacquiring owned addresses can be used to change the shared-status via the 
set of local addresses L which have to be a subset of A. The released addresses 
R have to be owned and distinct from the acquired addresses A. After the write 
the new ownership of the thread is obtained by adding the acquired addresses 
A and releasing the addresses R: O U A — R. Analogously the set of acquired 
addresses A is updated. The released addresses R are augmented to the shared 
addresses S and the local addresses L are removed. We also take care about the 
write permissions in the shared state: the released addresses in set W as well 
as the acquired addresses are marked writable: S ®w R ©A L. The auxiliary 
ternary operators to augment and subtract addresses from the sharing map are 
defined as follows: 

S 0w R = \a. if a e R then [a G WJ else S a 

S Q A L = 

\a. if a e L then _L else case S a of _L => _L [writeable\ => [a e A V writeable\ 

The read-modify-write instruction first adds the current value at address a 
to temporary t and then checks the store condition cond on the temporaries. 
If it fails this read is the final result of the operation. Otherwise the store is 
performed. The resulting value of the temporary t is specified by the function 
ret which considers both the old and new value as input. As the read-modify- 
write instruction is an interlocked operation which flushes the store buffer as a 
side effect the dirty flag T> as well as the set of acquired addresses A are reset. 
The other effects on the ghost state and the safety sideconditions are the same 
as for the volatile read and volatile write, respectively. 

The only effect of the fence instruction in the system without store buffer is 
to reset the dirty flag and the set of acquired addresses. 

The ghost instruction Ghost A L allows to acquire ownership when no write 
is involved i.e., when merely reading from memory. It has the same safety require- 
ments as the corresponding parts in the write instructions. Releasing ownership 
can always be delayed to the next volatile (or interlocked) write instruction, 
since only with the write another thread can gain information about released 
addresses. In the simulation proof we build on the fact that all pending ghost 
operations in the store buffer until the first volatile write may only acquire ad- 
dresses to the ownership of the thread. 

4.2 Store buffer machine 

The store buffer machine extends the virtual machine by maintaining a list of 
outstanding memory writes. Write instructions are appended to the store buffer 
and read instructions are satisfied from the store buffer if possible. To support 
our coupling relation between a configuration of the store buffer machine and a 
configuration of the virtual machine, we also maintain additional bookkeeping 
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information inside the store buffer. For every write we keep the volatile flag 
and the store operation. Moreover we record read, program and ghost steps. 
This allows us to restore the necessary computation history of the store buffer 
machine and relate it to the virtual machine which may fall behind the store 
buffer machine during execution. Altogether an entry in the store buffer is either 
a 

— READ S b volatile a t v, recording a corresponding read from address a which 
loaded the value v to temporary t, or a 

— WRiTE s b volatile a sop v for an outstanding write, where operation sop 
evaluated to value v, or of the form 

— PROG s b p p' is', recording a program transition from p to p' which issued 
instructions is', or of the form 

— GHOST sb A L, recording a corresponding ghost operation to acquire ad- 
dresses A and keep addresses L local. 

As defined in Figure 5 a write updates the memory when it exits the store buffer, 
all other store buffer entries may only have an effect on the ghost state. The effect 
on the ownership information is analogous to the corresponding operations in 
the virtual machine. The transitions defined in Figure 6 are straightforward 



(m, WRiTE sb False a sop v A L R W ■ sb, O, A, S) — » sb (m(a := v), sb, O, A, S) 

<D'=OuA-R A'=AUA-R S' = S® w RQaL 
(m, WRiTE sb True a sop v A L R W ■ sb, O, A, S) -^ sb (m(a := v), sb, O', A', S') 

(m, READ sb volatile a t v ■ sb, O, A, S) -^ s b (m, sb, O, A, S) 

(m, PROG s b p p' is ■ sb, O, A, S) -^ s b (m, sb, O, A, S) 

(m, GHOST s b A L ■ sb, O, A, S) ^ s b (m, sb, O U A, A U A, S Q A L) 

Fig. 5: Store buffer transitions 

extensions of the transitions of the virtual machine. With buffered-val sb a we 
obtain the value of the last write to address a which is still pending in the 
store buffer. In case no outstanding write is in the store buffer we read from 
the memory. Store operations have no immediate effect on the memory but 
are queued in the store buffer instead. This also includes their effect on the 
ownership information. Interlocked operations and the fence operation require 
an empty store buffer, which means that it has to be flushed before the action 
can take place. We instantiate the global transition system with the rules of 
Figures 5 and 6. The program transitions are still a parameter. We refer to a 
transition by (ts, S, m) (is', S', m'). 
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v = (case buffered-val sb a of _L => m a [^'J =^ 
sb' = sb @ [READ s b volatile a t v] 

(Read volatile a t ■ is, -d, sb, in, Q) -^-» m (is, i9(t i— » v), sb', m, 5) 

sb' = sb @ [WRiTE s b False a (D, f) (f i))ALR W] 

(Write False a(D,f)ALRW- is, sb, m, 5) ^ m (is, tf, sb', m, 5) 

sb' = sb @ [WRiTE sb True a (D, f ) (f 0) A L R W] 
g = (V, 0, A, S) Q' = (True, 0, A, S) 

(Write True a (D, f) A L RW ■ is, ■d, sb, m, Q) % m (is, ■d, sb', m, g') 

-. cond (tf(t « m a)) = (2?, 0, .A, <S) 0' = (False, 0, 0, <S) 

(RMW a t (D, f) cond ret A L R W ■ is, [], m, Q) (is, tf(t i-» m a), [], m, 5') 

cond (#(t i— » m a)) 

tf' = tf(t i * ret (m a) (f (tf(t h-> m a)))) m' = m(a := f (tf(t h-> m a))) 
g = (2?, 0, A, S) g' = (False, U A - R, 0, S W R 0a L) 

(RMW a t (D, f) cond ret A L R W ■ is, ■d, [}, m, Q) ^> m (is, [], m', £') 
(Fence • is, tf, [], m, V, 0, A, S) 4 m (is, tf, [], m, False, 0, 0, 5) 
(Ghost A L ■ is, d, sb, m, V, 0, A, 5) ^ m (is, tf, sb @ [GHOST sb A L], m, D, 0, A, 5) 

Fig. 6: Memory transitions of store buffer machine 
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4.3 Coupling relation 



In this section we establish the coupling relation between a configuration of a 
machine with store buffer and the virtual machine without store buffer. It allows 
us to simulate every computation step of the store buffer machine by a sequence 
of steps (potentially empty) on the virtual machine. This transformation is essen- 
tially a scqucntialization of the trace of the store buffer machine. When a thread 
of the store buffer machine executes a non- volatile operation, it only accesses 
memory which is not modified by any other thread (it is either owned or read- 
only) . Although a non- volatile store is buffered, we can immediately execute it on 
the virtual machine, as there is no competing store of another thread. The same 
is true for reads which get recorded in the store buffer. However, with volatile 
writes we have to be careful, since concurrent threads may also compete with 
some volatile write to the same address. At the moment the volatile write enters 
the store buffer we do not yet know when it will be issued to memory and how it 
is ordered relatively to other outstanding writes of other threads. We therefore 
suspend the write on the virtual machine from the moment it enters the store 
buffer to the moment it is issued to memory. For volatile reads our access policy 
guarantees that there is no volatile write in the store buffer by flushing the store 
buffer if necessary. So there are at most some outstanding non- volatile writes in 
the store buffer, which are already executed on the virtual machine, as described 
before. Altogether this suggests the following coupling relation: the memory of 
the virtual machine is obtained from the memory of the store buffer machine, by 
flushing every store buffer until we reach a volatile write. The remaining store 
buffer entries are suspended as instructions. The suspended reads are not yet 
visible in the temporaries of the virtual machine. Similar the ownership effects 
of the suspended ghost operations is not yet visible in the virtual machine. 

Consider the following configuration of a thread ts s b r/l m the store buffer 
machine, where ik are the instructions and Sk the store buffer entries. Let s v be 
the first volatile write in the store buffer. Keep in mind that new store buffer 
entries are appended to the end of the list and entries exit the store buffer and 
are issued to memory from the front of the list. 

tSsb[j] = (P, [h, • • • , in], [Si, . . . , S v , Sv+1, • • • , Sm], 2?, O, A) 

The corresponding configuration ts^j in the virtual machine is obtained by sus- 
pending all store buffer entries beginning at s v to the front of the instructions. 
A store buffer R,EAD sb / WRiTE sb / GHOST sb is converted to a Read / Write 
/ Ghost instruction. We take the freedom to make this coercion implicit in the 
example. The store buffer entries preceding s v have already made their way to 
memory, whereas the suspended read operations are not yet visible in the tem- 
poraries Similar, the suspended updates to the ownership sets and dirty flag 
are not yet recorded in O', A' and V. 

ts [j] = (Pi I S v, s v +i, • • • , S m , J'l, . . . , i n ], 0, V, O', A') 

This example illustrates that the virtual machine falls behind the store buffer 
machine in our simulation, as store buffer instructions are suspended and reads 
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(and ghost operations) are delayed and not yet visible in the temporaries (and 
the ghost state) . This delay can also propagate to the level of the programming 
language, which communicates with the memory system by reading the tempo- 
raries and issuing new instructions. For example the control flow can depend on 
the temporaries, which store the result of branching conditions. It may happen 
that the store buffer machine already has evaluated the branching condition by 
referring to the values in the store buffer, whereas the virtual machine still has to 
wait. Formally this manifests in still undefined temporaries. Now consider that 
the program in the store buffer machine makes a step from p to (p', is'), which 
results in a thread configuration where the program state has switched to p', 
the instructions is' are appended and the program step is recorded in the store 
buffer: 

tssb'y] = (p', [k, • ■ • , i n ] @ is', [si, . . . , s v , . . . , s m , PROG sb p p' is], V, O, A) 

The virtual machine however makes no step, since it still has to evaluate the 
suspended instructions before making the program step. The instructions is' are 
not yet issued and the program state is still p. We also take these program 
steps into account in our final coupling relation (ts s b, S s b, ^sb) ~ {ts, S, m), 
defined in Figure 7. We denote the already simulated (flushed) store buffer cn- 

m = flush-all-until-volatile-write ts s b m s b 
S = share-all-until-volatile-write ts s b S s b \ts s b\ = ts 
V/<|ts sb |. 

let (p, /s^, Q, sb, V sb , O, A) = ts sb [;j; 

flushs — takeWhile not-volatile-write sb; 
suspends — dropWhile not-volatile-write sb 
in 3 is D. instrs suspends @ /s s b = is @ prog-instrs suspends A 
V sb = (V V refs volatile-Write sb / 0) A 
ts [;] = 

(hd-prog p suspends, is, 6\(_ read _ trT , ps suspe „ds), (), T>, 
acquire flushs O, acquire flushs A) 

(ts sb , Ssb, msb) ~ (ts, S, m) 

Fig. 7: Coupling relation 

tries by flushs and the suspended ones by suspends. The function instrs converts 
them back to instructions, which are a prefix of the instructions of the virtual 
machine. We collect the additional instructions which were issued by program 
instructions but still recorded in the remainder of the store buffer with function 
prog-instrs. These instructions have already made their way to the instructions 
of the store buffer machine but not yet on the virtual machine. This situation 
is formalized as instrs suspends @ /Ssb = is @ prog-instrs suspends, where is are 
the instructions of the virtual machine. The program state of the virtual ma- 
chine is cither the same as in the store buffer machine or the first program state 
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recorded in the suspended part of the store buffer. This state is selected by 
hd-prog. The temporaries of the virtual machine are obtained by removing the 
suspended reads from The memory is obtained by flushing all store buffers 
until the first volatile write is hit, excluding it. Thereby only non- volatile writes 
are flushed, which are all thread local, and hence could be flushed in any order 
with the same result on the memory. Function flush-all-until-volatile-write flushes 
them in order of appearance. Similarly the sharing map of the virtual machine is 
obtained by flushing all store buffers until the first volatile write via the function 
share-all-until-volatile-write. For the local ownership sets O and A the auxiliary 
function acquire calculates the outstanding effect of the already simulated parts 
of the store buffer. 

One may think of simplifying the coupling relation by avoiding flushing al- 
together and just suspending the whole store buffer. However, consider the fol- 
lowing scenario. A thread is reading from a volatile address. It can still have 
non-volatile writes in its store buffer. Hence the read would be suspended, and 
we could miss updates made by other threads to this address. 

4.4 Simulation 

Theorem 1 is our core simulation theorem. Provided that all reachable states of 
the virtual machine arc safe, a step of the store buffer machine can be simulated 
by a (potentially empty) sequence of steps on the virtual machine, maintaining 
the coupling relation and an invariant on the configurations of the store buffer 
machine. 

Theorem 1 (Simulation). 

(tSsb, Ssb, m sb ) (tSsb', Ssb', msb') A (ts sb , S sb , m sb ) ~ (ts, S, m) A 
safe-reach (ts, S, m) A invariant ts sb S sb m sb — > 
invariant ts sb ' 5 sb ' m sb ' A 

(3 ts' S' m'. (ts, S, m) 4>* (ts', S', m') A (ts sh ', S sb ', m sb ') ~ (ts', S', m'j) 

In the following we discuss the invariant invariant ts s t, S m s b, where we commonly 
refer to a thread configuration ts s b[,-] = (p, is, sb, V, O, A) for i < |ts s b|. By 
outstanding references we refer to read and write operations in the store buffer. 
The invariant is a conjunction of several sub-invariants grouped by their content: 

invariant ts sb S m sb = ownership-inv S ts sb A sharing-inv S ts sb A 
temporaries-inv ts sb A data-dependency-inv ts sb A history-inv ts sb m sb A 
flush-inv ts sb A valid ts sb 

Ownership, (i) For every thread all outstanding non-volatile references have to 
be owned or refer to read-only memory, (ii) Every outstanding volatile write is 
not owned by any other thread, (iii) Outstanding accesses to read-only memory 
are not owned, (iv) The ownership sets of every two different threads are distinct. 
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Sharing, (i) All outstanding non volatile writes are unshared, (ii) All unowned 
addresses are shared, (iii) No thread owns read-only memory, (iv) The owner- 
ship annotations of outstanding ghost and write operations are consistent (e.g., 
released addresses are owned at the point of release) . (v) There is no outstanding 
write to read-only memory. 

Temporaries. Temporaries are modeled as an unlimited store for temporary reg- 
isters. We require certain distinctness and freshness properties for each thread, 
(i) The temporaries referred to by read instructions are distinct, (ii) The tem- 
poraries referred to by reads in the store buffer are distinct, (iii) Read and write 
temporaries are distinct, (iv) Read temporaries are fresh, i.e., arc not in the 
domain of 

Data dependency. Data dependency means that store operations may only de- 
pend on previous read operations. For every thread we have: (i) Every operation 
(D, f ) in a write instruction or a store buffer write is valid according to valid-sop 
(D, f), i.e., function f only depends on domain D. (ii) For every suffix of the 
instructions of the form Write volatile a (D, f) A L R W ■ is the domain 
D is distinct from the temporaries referred to by future read instructions in 
is. (iii) The outstanding writes in the store buffer do not depend on the read 
temporaries still in the instruction list. 

History. The history information of program steps and read operations we record 
in the store buffer have to be consistent with the trace. For every thread: (i) The 
value stored for a non volatile read is the same as the last write to the same 
address in the store buffer or the value in memory, in case there is no write in 
the buffer, (ii) All reads have to be clean. This results from our flushing policy. 
Note that the value recorded for a volatile (and acquired non-volatile) read in 
the initial part of the store buffer (before the first volatile write) , may become 
stale with respect to the memory. Remember that those parts of the store buffer 
are already flushed in the virtual machine and thus cause no trouble, (iii) For 
every read the recorded value coincides with the corresponding value in the tem- 
poraries, (iv) For every Writer volatile a (D, f ) v A L RW the recorded value 
v coincides with f and domain D is subset of dom ■& and is distinct from the 
following read temporaries. Note that the consistency of the ownership annota- 
tions is already covered by the aforementioned invariants, (v) For every suffix in 
the store buffer of the form PROG s b Pi P2 is' • sb', either pi = p in case there is 
no preceding program node in the buffer or it corresponds to the last program 
state recorded there. Moreover, the program transition read _ tmps sb ') F pi — » p 
(p2, is 7 ) is possible, i.e., it was possible to execute the program transition at that 
point, (vi) The program configuration p coincides with the last program config- 
uration recorded in the store buffer, (vii) As the instructions from a program 
step are at the one hand appended to the instruction list and on the other hand 
recorded in the store buffer, we have for every suffix sb' of the store buffer: 3 is'. 
instrs sb' @ is = is' @ prog-instrs sb', i.e., the remaining instructions is correspond 
to a suffix of the recorded instructions prog-instrs sb'. 
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Flushes. If the dirty flag is unset there are no outstanding volatile writes in the 
store buffer. 

Program step. The program-transitions are still a parameter of our model. In 
order to make the proof work, we have to assume some of the invariants also 
for the program steps. We allow the program-transitions to employ further in- 
variants on the configurations, these are modeled by the parameter valid. For 
example, in the instantiation later on the program keeps a counter for the tempo- 
raries, for each thread. We maintain distinctness of temporaries by restricting all 
temporaries occurring in the memory system to be below that counter, which is 
expressed by instantiating valid. Program steps, memory steps and store buffer 
steps have to maintain vaJid. Furthermore we assume the following properties of 
a program step: (i) The program step generates fresh, distinct read temporaries, 
that are neither in $ nor in the store buffer temporaries of the memory system, 
(ii) The generated memory instructions respect data dependencies, and are valid 
according to valid-sop. 

Proof. We do not go into details but rather sketch the main arguments for 
simulation of a step in the store buffer machine by a potentially empty sequence 
of steps in the virtual machine, maintaining the coupling relation. The first case 
distinction in the proof is on the global transitions in Figure 2. (i) Program step: 
we make a case distinction whether there is an outstanding volatile write in the 
store buffer or not. If not the configuration of the virtual machine corresponds 
to the flushed store buffer and we can make the same step. Otherwise the virtual 
machine makes no step as we have to wait until all volatile writes have exited the 
store buffer, (ii) Memory step: we do case distinction on the rules in Figure 6. For 
read, non volatile write and ghost instructions we do the same case distinction as 
for the program step. If there is no outstanding volatile write in the store buffer 
we can make the step, otherwise we have to wait. When a volatile write enters 
the store buffer it is suspended until it exists the store buffer. Hence we do no 
step in the virtual machine. The read-modify-write and the fence instruction can 
all be simulated immediately since the store buffer has to be empty, (iii) Store 
Buffer step: we do case distinction on the rules in Figure 5. When a read, a non 
volatile write, a ghost operation or a program history node exits the store buffer, 
the virtual machine does not have to do any step since these steps are already 
visible. When a volatile write exits the store buffer, we execute all the suspended 
operations (including reads, ghost operations and program steps) until the next 
suspended volatile write is hit. This is possible since all writes are non volatile 
and thus memory modifications are thread local. 

A common argument in various places in the proof is to rule out potential 
races by constructing calculations of the virtual machine that lead to an unsafe 
state and are thus unreachable in a safe execution. Here we make use of the fact, 
that the ghost operations in the prefixes of the store buffers that are already 
simulated in the virtual machine may only acquire new addresses to the owner- 
ship of a thread but not realease addresses. From the viewpoint of other threads 
this may only lead to a more restrictive configuration but never to more liberal 
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one, making the construction of an unsafe execution of the virtual machine pos- 
sible without referring to an older state as the current state of the store buffer 
machine. 



4.5 PIMP 

PIMP is a parallel version of IMP [9], a canonical WHILE- language. 

An expression e is either (i) Const v, a constant value, (ii) Mem volatile 
a, a (volatile) memory lookup at address a, (iii) Tmp sop, reading from the 
temporaries with a operation sop which is an intermediate expression occurring 
in the transition rules for statements, (iv) Unop f e, a unary operation where f 
is a unary function on values, and finally (v) Binop f ei e 2 , a binary operation 
where f is a binary function on values. 

A statement s is cither (i) Skip, the empty statement, (ii) Assign volatile 
a e A L R W, a (volatile) assignment of expression e to address expression a, 
(iii) CAS a c e s e A L R W, atomic compare and swap at address expression a 
with compare expression c e and swap expression s e , (iv) Seq si S2, sequential 
composition, (v) Cond e si S2, the if-then-else statement, (vi) While e s, the 
loop statement with condition e, (vii) SGhost, and SFence as stubs for the 
corresponding memory instructions. 

The key idea of the semantics is the following: expressions arc evaluated 
by issuing instructions to the memory system, then the program waits until 
the memory system has made all necessary results available in the temporaries, 
which allows the program to make another step. Figure 8 defines expression 
evaluation. The function used-tmps c calculates the number of temporaries that 



issue-expr t (Const v) 
issue-expr t (Mem volatile a) 
issue-expr t (Tmp (D, f)) 
issue-expr t (Unop f e) 
issue-expr t (Binop f ei e 2 ) 

eval-expr t (Const v) 
eval-expr t (Mem volatile a) 
eval-expr t (Tmp (D, f)) 
eval-expr t (Unop f e) 
eval-expr t (Binop f ei e 2 ) 



[Read volatile a t] 


issue-expr t e 

issue-expr t ei @ issue-expr (t + used-tmps ei) e 2 

(0, xe. v) 

({t}, X6. the (0 t)) 

(D, f) 

let (D, Q = eval-expr t e in (D, X6. f (f e 6)) 
let (Di, fi) = eval-expr t ei; 

(D2, 6) = eval-expr (t + used-tmps ei) e 2 
in (Di U D 2 , A0. f (ft 0) (f 2 0)) 



Fig. 8: Expression evaluation 



are necessary to evaluate expression e, where every Mem expression accounts to 
one temporary. With issue-expr t e we obtain the instruction list for expression e 
starting at temporary t, whereas eval-expr t e constructs the operation as a pair 
of the domain and a function on the temporaries. 



20 



The program transitions are defined in Figure 9. We instantiate the program 
state by a tuple (s, t) containing the statement s and the temporary counter t. 
To assign an expression e to an address (-expression) a we first create the mem- 
ory instructions for evaluation the address a and transforming the expression 
to an operation on temporaries. The temporary counter is incremented accord- 
ingly. When the value is available in the temporaries we continue by creating 
the memory instructions for evaluation of expression c followed by the corre- 
sponding store operation. Note that the ownership annotations can depend on 
the temporaries and thus can take the calculated address into account. 

Execution of compare and swap CAS involves evaluation of three expressions, 
the address a the compare value c e and the swap value s e . It is finally mapped 
to the read-modify-write instruction RMW of the memory system. Recall that 
execution of RMW first stores the memory content at address a to the specified 
temporary. The condition compares this value with the result of evaluating c e 
and writes swap value s a if successful. In either case the temporary finally returns 
the old value read. 

Sequential composition is straightforward. An if-then-else is computed by first 
issuing the memory instructions for evaluation of condition e and transforming 
the condition to an operation on temporaries. When the result is available the 
transition to the first or second statement is made, depending on the result of 
isTrue. Execution of the loop is defined by stepwise unfolding. Ghost and fence 
statements are just propagated to the memory system. To instantiate Theorem 1 
with PIMP we define the invariant parameter valid, which has to be maintained 
by all transitions of PIMP, the memory system and the store buffer. Let ■& 
be the valuation of temporaries in the current configuration, for every thread 
configuration ts 5 b[,-] = ((s, t), is, sb, V, O, A) where i < |ts s b| we require: 
(i) The domain of all intermediate Tmp (D, f) expressions in statement s is 
below counter t. (ii) All temporaries in the memory system including the store 
buffer are below counter t. (iii) All temporaries less than counter t are either 
already defined in the temporaries # or are outstanding read temporaries in the 
memory system. 

For the PIMP transitions we prove these invariants by rule induction on 
the semantics. For the memory system (including the store buffer steps) the 
invariants are straightforward. The memory system does not alter the program 
state and does not create new temporaries, only the PIMP transitions create 
new ones in strictly ascending order. 

5 Conclusion 

We have presented a practical and flexible concurrent programming discipline 
that ensures sequential consistency on TSO machines, such as present in current 
x64 architectures. Our approach covers a wide variety of concurrency control, 
covering locking, data races, single writer multiple readers, read only and thread 
local portions of memory. We minimize the need for store buffer flushes to opti- 
mize the usage of the hardware. Our theorem is not coupled to a specific logical 
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V sop. a / Tmp sop 

a' = Tmp (eval-expr t a) t' = t + used-tmps a is = issue-expr t a 
■dh (Assign volatile a e A L R W, t) — > p ((Assign volatile a' e A L R W, t'), is) 

D C dom i? 

is = issue-expr tei [Write volatile (a tf) (eval-expr t e) (A tf) (L tf) (R tf) (W i?)] 
i?h (Assign volatile (Tmp (D, a))eALRff, t) ^ p ((Skip, t + used-tmps e), is) 

V sop. a / Tmp sop 

a = Tmp (eval-expr t a) t = t + used-tmps a is = issue-expr t a 

#h (CAS a c e s e A L RW, t) -> p ((CAS a' c e s e A L R W, t'), is) 

Vsop. c e / Tmp sop 
c e ' = Tmp (eval-expr t c e ) t' = t + used-tmps c e is = issue-expr t c e 
#h (CAS (Tmp a) c e s e A L R W, t) ((CAS (Tmp a) c e ' s e A L R W, t), is) 

D a C dom ■& D c C dom i? eval-expr t s e = (D, f) 
t' = t + used-tmps s e cond = (AO. the (0 t') = c 8) ret — (At/i \/ 2 . 
is = issue-expr t s e @ [RMW (a 0) t' (D, f) cond ret (A 0) (L tf) (R tf) (W tf)] 

#h (CAS (Tmp (D a , a)) (Tmp (D c , cf) s e A L R W, t) -> p ((Skip, Sue t), is) 
tfh (si, t) ^ p ((si', t'), is) 



i?h (Seq si s 2 , t) -> p ((Seq si' s 2 , t'), is) 

tfh (Seq Skip s 2 , t) -> p ((s 2 , t), []) 

Vsop. c / Tmp sop 
e' = Tmp (eval-expr t e) t' = t + used-tmps e is = issue-expr t c 

■dh (Cond e si s 2 , t) — > p ((Cond e'si s 2 , t'), is) 

D C dom ?9 isTrue (e $) 

tfh (Cond (Tmp (D, e)) Sl s 2 , t) -^ p ((si, t), []) 

D C dom # -i isTrue (e #) 

tfh (Cond (Tmp (D, ef) si s 2 , t) ((s 2 , t), []) 

i?h (While e s, t) — » p ((Cond e (Seq s (While e sj) Skip, t), []) 
■d\- (SGhost A L, t) -> p ((Skip, t), [Ghost (A ■&) (L 0)]) 
i?h- (SFence, t) ^ p ((Skip, t), [Fence]) 



Fig. 9: Program transitions 



22 



framework like separation logic but is based on more fundamental arguments, 
namely the adherence to the access and flushing policy which can be discharged 
within any program logic. 

Related work. A categorization of various weak memory models is presented 
in [2] . It is compatible with the recent revisions of the Intel manuals [8] and the 
revised x86 model presented in [11]. The state of the art in formal verification of 
concurrent programs is still based on a sequentially consistent memory model. 
To justify this on a weak memory model often a quite drastic approach is cho- 
sen, allowing only coarse-grained concurrency usually implemented by locking. 
Thereby data races are ruled out completely and there are results that data race 
free programs can be considered as sequentially consistent for example for the 
Java memory model [3,14] or the x86 memory model [11]. Ridge [13] considers 
weak memory and data-races and verifies Peterson's mutual exclusion algorithm. 
He ensures sequentially consistency by flushing after every write to shared mem- 
ory. Burckhardt and Musuvathi [6] describe an execution monitor that efficiently 
checks whether a sequentially consistent TSO execution has a single-step exten- 
sion that is not sequentially consistent. Like our approach, it avoids having to 
consider the store buffers as an explicit part of the state. However, their condi- 
tion requires maintaining in ghost state enough history information to determine 
causality between events, which means maintaining a vector clock (which is it- 
self unbounded) for each memory address. Moreover, causality (being essentially 
graph reachability) is already not first-order, and hence unsuitable for many 
types of program verification. 

Future work. We currently have an asymmetry in the ghost operations for own- 
ership transfer: whereas we have a 'free flowing' ghost operation to acquire an 
address which can appear anywhere, releases are delayed to the next (volatile 
or interlocked) write operation. As sketched in Section 4.4 delaying releases is 
motivated by our simulation proof. To rule out certain races we argue that the 
state of the virtual machine is at most more restrictive as the state of the store 
buffer machine. As the virtual state is obtained by flushing all store buffers until 
the first volatile write is hit (excluding it) a release in that flushed section would 
violate this invariant. However, we believe this does not restrict expressibility as 
the only way for other threads to gain knowledge about the released address is 
via the next volatile write operation of the thread. Formally we want to liberate 
the points where an ownership release can happen by introducing free flowing 
releases. The informal key argument is that an unsafe execution with delayed 
releases implies an unsafe execution with free flowing releases. Between the re- 
lease point in the free flowing execution and the delayed release point (at the 
next volatile write) a thread only executes commuting operations with respect 
to bad races. Such a race at the delayed point already justifies a race at the 
release point in the free flowing execution. 

Another direction of future work is to take compiler optimization into ac- 
count. Our volatile accesses correspond roughly to volatile memory accesses 
within a C program. An optimizing compiler is free to convert any sequence 
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of non-volatile accesses into a (sequentially semantically equivalent) sequence 
of accesses. As long as execution is sequentially consistent, equivalence of these 
programs (e.g., with respect to final states of executions that end with volatile 
operations) follows immediately by reduction. However, some compilers are a 
little more lenient in their optimizations, and allow operations on certain local 
variables to move across volatile operations. In the context of C (where pointers 
to stack variables can be passed by pointer) , the notion of "locality" is somewhat 
tricky, and makes essential use of C forbidding (semantically) address arithmetic 
across memory objects. 

Finally, we should note that there are important programs that, in the pres- 
ence of store buffers, are correct but not sequentially consistent. A typical ex- 
ample is the following simplified form of barrier synchronization: each processor 
has a flag that it writes and other processors read, and each processor waits for 
all processors to set their flags before continuing past the barrier. This is not 
sequentially consistent - each processor might see his own flag set and later see 
all other flags clear - but it is still correct. One possibility is to give a more gen- 
eral reduction theorem that allows each processor to always treat store buffers 
of other processors as empty, and its own store buffer as empty except for brief 
periods of time. 
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